Slurm (quick guide)

This page is the quickest path to run jobs on the DGX.

Available partitions

Partition Max walltime Intended usage Command style
interactive10 02:00:00 Interactive debugging and quick tests on one 1g.10gb MIG srun
prod10 24:00:00 Batch jobs on one 1g.10gb MIG sbatch
prod40 24:00:00 Batch jobs on one 3g.40gb MIG sbatch
prod80 24:00:00 Batch jobs on one full A100 80GB GPU sbatch

In beginner terms: start with standard GPU (10 GB VRAM) (interactive10/prod10), move to large GPU (40 GB VRAM) (prod40) or full GPU (80 GB VRAM) (prod80) if your model does not fit.

The scheduler applies partition defaults for GPU type, task count and CPU count, so beginner commands can stay short.

Basic workflow

1) Quick Python test in current shell (CPU only, resource-limited)

You can run Python directly after login for quick checks:

python3 -c "import sys; print(sys.version)"
python3 my_script.py

This login environment is resource-limited by policy. Use it only for small tests and setup tasks.

2) Create and use a virtual environment

python3 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install numpy torch

Then run Python as usual:

python train.py

3) Open an interactive GPU session for debugging

srun -p interactive10 --time=01:00:00 --pty bash

Inside that shell, activate your virtual environment and run tests:

source venv/bin/activate
python train.py

4) Launch production jobs with sbatch

Recommended workflow (start with prod10):

cd ~/my_project
cp ~/slurm-prod10.sbatch ./job.sbatch
nano job.sbatch
sbatch job.sbatch

For reference, a minimal one-line submit also works:

sbatch -p prod10 --time=04:00:00 --wrap="bash -lc 'source venv/bin/activate && python train.py'"

slurm-prod10.sbatch is a basic template installed for users from /etc/skel. If your account predates this change, copy it manually once:

cp /etc/skel/slurm-prod10.sbatch ~/

Basic Slurm commands

sinfo                        # list partitions and node states
squeue -u $USER              # list your jobs
scontrol show job <jobid>    # inspect one job
scancel <jobid>              # cancel one job
sacct -j <jobid>             # job accounting/history

Notes

  • interactive10 accepts interactive jobs (srun).
  • prod10, prod40, prod80 are batch-oriented and should be used with sbatch.
  • For exact technical policy (GRES mapping, defaults, limits, quality of service), see Advanced partitions.
  • For the current physical/logical GPU split on this DGX, see GPU and MIG layout.